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1.   Introduction 

[UC]  introduced  the  concept  of  an  ultracomputer  and  reviewed  various  basic 
algorithms  for  such  an  ensemble  of  processors  containing  "shuffle"  [Qos,  Benes, 
and  Stone]  interconnections.  [UCN3]  described  a  style  for  programming  ultra- 
computers  and  rewrote  several  of  the  basic  algorithms  in  this  new  style.  This  new 
style  forbids  recursion  so  the  recursive  algorithms  of  [UC]  appeared  in  iterative 
form  in  [UCN3].  Although  removing  recursion  raises  no  theoretical  obstacle,  it 
creates  obvious  practical  irritations.  Nevertheless,  we  can  allow  recursion  within 
an  ultracomputer  programming  language.  Just  as  for  uniprocessors,  the  resulting 
recursive  implementations  of  ultracomputer  algorithms  may  require  more  ultra- 
computer  cycles,  memory  space,  and  may  make  more  synchronization  requests 
than  non-recursive  variants  of  the  same  algorithm. 

Thus  one  would  not  use  such  an  implementation  for  the  final  version  of  a 
production  program.  Nevertheless,  the  simplicity  of  the  recursive  form  of  an 
algorithm  will  sometimes  make  it  attractive.  In  particular,  since  ultracomputer 
algorithms  frequently  employ  a  divide  and  conquer  strategy,  the  use  of  recursion 
often  results  in  reduced  programming  effort  and  more  natural  code. 

Part  1  of  this  two  part  report  [PLUSl]  introduced  a  (presently  implemented) 
ultracomputer  simulator  called  PLUS  which  uses  the  multitasking  and  preprocess- 
ing features  of  PL/I  [LRM]  to  support  a  recursive  ultracomputer  programming 
style.  Since  the  simulation  is  written  in  PL/I,  the  powerful  debugging  features  of 
the  PL/I  checkout  compiler,  as  well  as  PIVI's  separate  compilation  faciltiy  arc 
available. 

Due  to  the  modular  natuix  of  PLUS'S  design,  only  a  minor  effort  is  needed 
to  reconfigure  it  to  support  intercoimection  schemes  other  than  the  ultracomputer 
shuffle.  In  particular,  the  layered  ultracomputer  variant  of  [UC]  and  the  multidi- 
mensional variants  of  Harrison  and  Kalos  (sec  [UCN6])  arc  easy  to  simulate. 

In  part  1  we  described  the  simulation  system  and  furnished  a  "User's 
Guide".    In  this  note  we  discuss  the  system's  implementation  and  prove  both 


correctness  and  freedom  from  deadlock. 

Section  11  of  this  paper,  which  Uke  the  present  section  is  largely  taken  from 
[PLUSl],  reviews  the  PLUS  model  of  multiple  processors  and  the  synchronization 
issues  that  emerge.  Section  HI  introduces  the  Gatekeeper  module  and  demon- 
strates that  the  system  fimctions  correctly  and  is  deadlock  free.  Finally,  section 
rV  presents  some  concluding  remarks. 

2.  Synchronization  Requirements 

As  suggested  in  [UCN3],  we  suppose  that  all  processors  in  the  ultracomputer 
will  execute  the  same  program.  Note  that  this  does  not  imply  an  SIMD  architec- 
ture since  conditional  statements  are  permitted  and  the  processors  execute  asyn- 
chronously. Our  basic  idea  is  to  write  such  programs  as  PL/I  procedures  contain- 
ing an  additional  parameter  representing  the  processor  number.  Then  this  pro- 
cedure is  invoked  as  a  task,  once  for  each  processor.  The  PL/I  multitasking  facil- 
ity allows  the  multiple  invocations  of  the  procedure  thereby  created  to  run  "in 
parallel". 

Communication  between  each  processor  and  its  four  neighbors  (via  nearest 
neighbor  and  shuffle  connections)  is  handled  using  global  arrays.  Consider  the 
SUMMING  procedure  as  an  example  and  assume  that  the  declaration 

DECLARE  W  (0:MAX_PE)  FLOAT; 
appears  global  to  the  procedure  definition  for  SUMMING,  where  here,  as  else- 
where, PLUS  uses  PE  to  abbreviate  "processing  element"  or  "processor".  Then 
the  task  corresponding  to  processor  N  will  refer  to  W(N)  for  the  value  stored  in 
processor  N  and  to  W(RIGHT_PE(N)),  W(LEFT_J»E(N)), 
W(SHUFFLE_PE(N)),  and  W(UNSHUFPLE_PE(N))  for  the  values  stored  in  its 
four  neighbors. 

Appropriate  sychronization  is  required  to  insure  that  if  one  processor  refer- 
ences a  variable  stored  in  a  logical  neighbor,  the  value  obtained  is  current.  This 
issue  also  appears  in  the  model  proposed  in  [UCN3]  where  a  conditional  state- 
ment^ can  cause  the  processors  to  lose  synchronization  and  will  therefore  often 
end  with  a  resynchronization  request.  Of  course,  it  is  neither  necessary  nor 
desirable  for  the  tasks  constituting  our  simulation  to  be  in  step  at  all  times;  as 
long  as  they  are  referencing  only  local  variables,  they  may  nm  completely  asyn- 
chronously. But  non-local  references  require  more  careful  treatment.  Consider 
once  more  the  global  declaration 

DECLARE  W(0:MAXL_PE)  FLOAT; 
and  assume  that  for  each  N,  processor  N  executes 

W(N)  =  W(N-l)  +  W(N); 
Care  is  needed,  since  if  (without  the  programimer  intending  this  to  be  the  case) 


^With  a  condition  depending  on  the  processor  number  as  implied  by  the  use  of  such  dictions 
as  "each  even  numbered  processor  adds  1  to  x" 
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processor  3  updated  W(3)  prior  to  processor  4  referencing  W(3),  the  value 
assigned  to  W(4)  would  be  incorrect. 

Given  the  above  global  declaration  we  will  say  that  task  N  owns  the  com- 
ponent W(N).  We  consider  W(N)  to  be  local  to  task  N  and  non-local  to  all  tasks 
M=jtN.  Our  model  forbids  a  task  to  update  a  non-local  variable.  We  also  insist 
that  whenever  a  task  references  a  non-local  variable,  it  does  so  using  a  synchroni- 
zation macro. 

Should  only  a  proper  subset  of  the  tasks  require  a  non-local  reference,  these 
tasks  (called  "snoopers"  since  they  are  to  exzimine  nonlocal  data)  execute  the 
macro,  SYNC_SET.  This  macro  synchronizes  all  these  processors  and  assigns  the 
nonlocal  value  being  referenced  to  a  local  variable.  The  other  tasks  (called 
"observers")  execute  the  macro,  SYNC,  that  synchronizes  them  with  the  snoopers 
but  does  no  assignment.  When  a  task  executes  one  of  these  macros,  that  task 
enters  a  wait  state  and  remains  in  this  state  imtil  all  the  tasks  have  begun  execut- 
ing either  SYNC_SET  or  SYNC.  Eventually,  all  the  tasks  are  waiting.  At  this 
point,  with  the  help  of  a  software  module  called  the  Gatekeeper  (described  in  sec- 
tion ni),  each  snooper  is  allowed  to  evaluate  its  nonlocal  expression.  The  tasks 
wait  again,  assuring  that  all  the  expressions  are  evaluated,  and  finally  the  Gate- 
keeper allows  them  to  proceed  once  more.  The  snoopers  are  free  to  complete 
their  assignments  and  each  task  may  leave  its  macro. 

We  prove  in  section  HI  that  this  scheme  is  deadlock  free.  Naturally  deadlock 
may  occur  if  the  system  is  used  incorrectly.  K  only  a  proper  subset  of  the  tasks 
execute  a  macro,  they  will  wait  while  the  others  proceed.  Should  these  later  tasks 
terminate,  the  system  deadlocks.  Thus  another  requirement  is  that  when  one 
task  synchronizes,  they  all  do.  All  the  above  requirements  can  be  combined  to 
yield: 

Non-local  updates  are  forbidden.  When  non-local  references  are  required,  every 
task  executes  a  synchronization  macro.  The  snoopers  SYNC_SET  a  local  variable 
EQUAL_TO  a  non-local  expression.   The  observers  SYNC. 

3.   The  SYNC.SET  and  SYNC  Macros  and  the  Gatekeeper 

In  this  section  we  describe  in  detail  the  macros  and  the  Gatekeeper  men- 
tioned above  and  prove  their  correctness  and  freedom  from  deadlock. 

A  reader  desiring  to  learn  how  to  use  the  macro  package  is  advised  to  see  the 
companion  paper  [PLUSl]  which  contains  a  users  guide.  In  order  to  improve  the 
exposion  of  the  present  paper  some  Uberties  will  be  taken  with  the  actual  user 
code.  If  however  certain  defaults  are  communicated  to  the  macro  package  (see 
[PLUSl]  for  details),  the  user  code  would  be  nearly  identical  to  that  presented 
here. 
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Let  there  be  NMAX+1  (=  2^  processors  numbered  0,...,NMAX  and  again 
consider 

W(N)  =  W(N-l)  +  W(N); 

By  the  requirement  given  in  the  previous  section,  this  statement  must  be  executed 
by  task  N,  i.e.  the  task  corresponding  to  the  processor  numbered  N.  Let  us 
assume  that  the  above  assignment  is  to  occur  only  f or  0  <  N  <  K  (this  is  not  the 
case  for  the  SUMMING  procedure  from  which  this  assignment  statement  was 
taken).  Thus  tasks  1,...,K-1  are  snoopers  and  tasks  0,K,K+1,...,NMAX  are 
observers. 

The  user  writes 

IF  (0  <  N  &  N  <  K) 

THEN  SYNC.SET  (W)  EQUAL_TO  (W(N-1)+W(N)); 
ELSE  SYNC; 

This  expands  into 

IF  (0  <  N  «&  N  <  K) 
THEN  DO; 

TEMP  =  W(N-l)  +  W(N); 

ss#l; 

WATT  (GATEl); 

W(N)  =  TEMP; 

ss#2; 

WArr(GATE2); 

END; 
ELSE  DO; 

ss#l; 

WATT  (GATEl); 

ss#2; 

WATT  (GATE2); 

END; 

Both  ss#l  and  ss#2  consist  of  synchronization  statements  that  are  described 
below.  They  have  no  direct  effect  on  any  of  the  user  tasks.  Instead,  they  influ- 
ence the  Gatekeeper,  an  independent  task  not  corresponding  to  any  one  proces- 
sor. The  Gatekeeper  in  turn  effects  the  user  tasks  by  locking  and  unlocking 
GATEl  and  GATE2. 

When  execution  begins  GATEl  is  locked  and  GATE2  is  unlocked.  The 
Gatekeejjer  consists  of  the  following. 
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INFINnEJ^OOP:  DO  WfflLE  (TRUE); 
WATT  (A); 

lock  GATE2;  unlock  GATEl; 
WATT  (B); 

lock  GATEl;  unlock  GATE2; 
E>fD  ESfFINrrE_LOOP; 

The  execution  of  a  realistic  user  program  will  result  in  each  user  task  gen- 
erating P  macro  executions  (the  same  number  of  executions  for  each  user  task). 
This  will  cause  the  Gatekeeper  to  complete  P  loop  iterations.  First,  however  we 
consider  a  simpler  case  where  the  Gatekeeper  consists  of  just  the  body  of  the  loop 
and  each  user  task  results  in  just  one  macro  execution.  We  may  as  weU  assume 
that  tiie  user  program  consists  of  just  the  above  IF  statement.  After  we  see  tiiat 
everything  is  OK  in  this  simple  case,  the  general  case  will  follow  since  we  will  see 
that  after  "one  time  through"  all  the  initial  conditions  are  restored  so  that  an 
agrument  by  induction  is  available. 

3.1.   One  Time  Through 

The  actual  FIJI  implementation  of  lock  and  unlock  uses  EVENT  variables. 
The  only  property  tiiat  we  need  is  that  a  task  can  pass  tiirough  a  gate  if  the  gate 
is  unlocked. 

The  execution  sequence  is  as  follows.  Initially,  each  user  task  and  the  Gate- 
keeper task  are  free  to  execute  asychronously.  However,  GATEl  is  locked  and  is 
not  unlocked  by  the  Gatekeeper  until  the  Gatekeeper  passes  tiirough  WATT  (A), 
where  A  is  an  array  A(0),...,A(NMAX)  of  gates  each  of  which  is  initially  locked. 
Similarly  B  consists  of  gates  B(0),...,B(NMAX)  initially  unlocked.  The  Gate- 
keeper can  pass  tiirough  WATT  (A)  if  all  the.  A  gates  are  unlocked.  At  first 
glance  it  appears  that  the  snoopers  execute 

TEMP  =  W(N-l)  +  W(N); 
and  tiien  the  system  deadlocks  witii  tiie  Gatekeeper  stuck  at  WATT  (A)  and  all 
tiie  user  tasks  stiick  at  WATT  (GATEl).  This  is  not  tiie  case,  however,  tiianks  to 

ss#l  which  actually  is 

lockB(N);  unlock  A(N); 

Our  first  result  shows  that  no  update  occurs  too  early.  All  tiie  snoopers 
complete  their  assignment  to  TEMP  prior  to  any  snooper  assigning  to  W(N).  No 
assignment  to  W(N)  can  occur  prior  to  GATEl  being  unlocked  and  tiiis  cannot 
happen  until,  for  each  N,  task  N  executes  ss#l  unlocking  A(N). 

We  now  show  tiiat  no  up»date  occurs  too  late.   This  requires  ss#2  which  is 

lockA(N);  unlock  B(N); 
AU  tiie  snoopers  complete  tiieir  update  before  any  user  task  leaves  its  macro.  By 
tiie  argument  given  in  Prop  1,  no  user  task  can  get  tiirough  GATEl  before 
GATE2  and  all  tiie  B's  have  been  locked.  Thus  before  any  user  task  leaves  tiie 
macro,  GATE2  must  be  unlocked.  This  happens  when  tiie  Gatekeeper  ter- 
minates.   But  by  the  time  the  Gatekeeper  passes  tiirough  WATT  (A),  all  tiie  B's 
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have  been  locked.  Therefore,  at  WATT  (B),  the  Gatekeeper  waits  until  all  the 
user  tasks  complete  ss#2. 

The  above  proof  also  showed:  The  Gatekeeper  terminates  before  any  user  task 
and  at  this  time  all  the  user  tasks  have  finished  ss#2.  In  particular  all  the  A's 
are  locked. 

The  key  point  that  allows  us  to  iterate  the  Gatekeeper  and  have  many  macros  per 
user  task  is  the  following  simple  The  first  task  to  terminate  is  the  Gatekeeper.   At 

that  time  all  the  user  tasks  are  executing  WATT  (GATE2).^  The  gates  are  back 
to  their  initial  conditions. 

3.2.  Infinite  Loops 

Now  assume  that  both  the  Gatekeeper  and  the  user  tasks  are  infinite  loops 
with  loop  body  as  in  the  previous  subsection.  The  system  is  deadlock  free.  If 
not,  all  the  tasks  are  waiting  at  locked  gates.  Assume  that  the  Gatekeeper  is  at 
WATT  (A).  Thus  GATE2  is  unlocked  and  the  user  tasks  must  be  at 
WATT  (GATEl).   But  then  all  the  A's  are  imlocked. 

3.3.  General  Case 

Now  we  consider  the  actual  situation  when  the  Gatekeeper  is  an  infinite  loop 
and  each  user  task  executes  an  arbitrary  number,  P,  of  macros  (all  user  tasks 
must  execute  the  same  number  of  macros). 

In  the  Corollary  to  Propostion  3,  we  have  already  seen  that  the  first  execu- 
tion results  in  the  Gatekeeper  completing  one  iteration,  the  gates  being  restored 
to  initial  condition,  and  the  user  tasks  reaching  WATT  (GATE2).  All  the  Gate- 
keeper can  do  is  WATT  (A).   Thus  the  user  tasks  are  free  to  leave  the  macro. 

When  one  user  task  leaves  its  macro,  it  will  eventually  enter  another  macro 
(or  else  terminate).  Note  that  the  gates  are  back  in  the  initial  positions.  Thus,  as 
we  have  already  seen,  this  user  task  will  remain  at  WATT  (GATEl)  until  all  the 
user  tasks  leave  the  first  macro,  enter  the  second,  and  reach  WATT  (GATEl). 
Therefore,  the  procedure  seen  in  "one  time  through"  repeats. 

Since  deadlock  is  impossible  by  Proposition  4,  a  user  task  will  eventually  ter- 
minate (assuming  that  each  user  task  has  a  finite  amount  of  work  to  do).  Each 
of  the  other  user  tasks  must  have  finished  its  last  macro  (recall  each  user  task 
executes  P  macros)  or  is  in  its  last  macro  waiting  at  an  unlocked  GATE2.  Thus 
all  the  user  tasks  terminate.  At  this  point  the  Gatekeeper  is  still  stuck  at 
WATT  (A). 

Now  that  these  user  tasks  have  ended,  the  main  task  awakens  from  its  long 
slumber  and  terminates  both  itself  and  the  Gatekeeper.  See  [PLUSl]  for  the 
actual  PL/I  implementation.    In  addition  to  proving  termination,  we  have  shown. 


-More  exactly  they  have  finished  the  previous  statement,  there  may  be  a  time  lag  between 
statements. 
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When  any  user  task  is  between  macro  executions,  the  values  of  the  global  vari- 
ables are  correct  (i.e.  as  set  in  the  last  macro  execution). 

In  [PLUSl]  we  illustrated  the  use  of  these  macros  by  implementing  several 
ultracomputer  algorithms  from  [UC].  The  results  of  the  present  paper  are  needed 
to  show  the  correctness  of  those  implementations. 

4.  Conclusion  and  Future  Work 

The  conclusion  of  this  paper  and  its  companion  [PLUSl]  is  simply  that,  with 
a  possible  sacrifice  of  efficiency,  it  is  a  reasonably  straightfoward  procedure  to 
implement  ultracomputer  algorithms  in  PL/I. 

An  important  question  to  answer  is  the  inherent  inefficiency  in  a  ultracom- 
puter implementation.  That  is,  in  a  hardware  ultracomputer,  how  long  would  the 
processors  be  waiting.  For  example  if  one  processor  needs  a  value  that  is  calcu- 
lated by  another  processor,  the  first  processor  may  have  to  wait  for  the  second. 

MULT,  a  joint  design  with  Qyde  Kruskal,  is  a  system  based  on  message 
passing  that  will  calculate  the  processor  waiting  times  we  have  just  described. 
This  system  has  been  implemented  with  the  assistance  of  Jay  Klein  and  is  the  sub- 
ject of  another  report  [UCN15]. 
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